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Abstract — The  problem  of  distributed  learning  and  channel 
access  is  considered  in  a  cognitive  network  with  multiple  sec¬ 
ondary  users.  The  availability  statistics  of  the  channels  are 
initially  unknown  to  the  secondary  users  and  are  estimated  using 
sensing  decisions.  There  is  no  explicit  information  exchange  or 
prior  agreement  among  the  secondary  users.  We  propose  policies 
for  distributed  learning  and  access  which  achieve  order-optimal 
cognitive  system  throughput  (number  of  successful  secondary 
transmissions)  under  self  play,  i.e.,  when  implemented  at  all  the 
secondary  users.  Equivalently,  our  policies  minimize  the  regret 
in  distributed  learning  and  access.  We  first  consider  the  scenario 
when  the  number  of  secondary  users  is  known  to  the  policy, 
and  prove  that  the  total  regret  is  logarithmic  in  the  number  of 
transmission  slots.  Our  distributed  learning  and  access  policy 
achieves  order-optimal  regret  by  comparing  to  an  asymptotic 
lower  bound  for  regret  under  any  uniformly-good  learning  and 
access  policy.  We  then  consider  the  case  when  the  number  of 
secondary  users  is  fixed  but  unknown,  and  is  estimated  through 
feedback.  We  propose  a  policy  in  this  scenario  whose  asymptotic 
sum  regret  which  grows  slightly  faster  than  logarithmic  in  the 
number  of  transmission  slots. 

Index  Terms — Cognitive  medium  access  control,  multi-armed 
bandits,  distributed  algorithms,  logarithmic  regret. 


I.  Introduction 

There  has  been  extensive  research  on  cognitive  radio  net¬ 
work  in  the  past  decade  to  resolve  many  challenges  not 
encountered  previously  in  traditional  communication  networks 
(see  13).  One  of  the  main  challenges  is  to  achieve  coex¬ 
istence  of  heterogeneous  users  accessing  the  same  part  of 
the  spectrum.  In  a  typical  cognitive  network,  there  are  two 
classes  of  transmitting  users,  viz.,  the  primary  users  who  have 
priority  in  accessing  the  spectrum  and  the  secondary  users  who 
opportunistically  transmit  when  the  primary  user  is  idle.  The 
secondary  users  are  cognitive  and  can  sense  the  spectrum  to 
detect  the  presence  of  a  primary  transmission.  However,  due 
to  resource  and  hardware  constraints,  they  can  sense  only  a 
part  of  the  spectrum  at  any  given  time. 

We  consider  a  slotted  cognitive  system  where  each  sec¬ 
ondary  user  can  sense  and  access  only  one  orthogonal  channel 
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in  each  transmission  slot  (see  Fig|T}.  Under  sensing  con¬ 
straints,  it  is  thus  beneficial  for  the  secondary  users  to  select 
channels  with  higher  mean  availability,  i.e.,  channels  which  are 
less  likely  to  be  occupied  by  the  primary  users.  However,  in 
practice,  the  channel  availability  statistics  are  a  priori  unknown 
to  the  secondary  users. 

Since  the  secondary  users  are  required  to  sense  the  medium 
before  transmission,  can  these  sensing  decisions  be  used  to 
learn  the  channel  availability  statistics?  If  so,  using  these 
estimated  channel  availabilities,  can  we  design  channel  access 
rules  which  maximize  the  transmission  throughput?  Designing 
provably  efficient  algorithms  to  accomplish  the  above  goals 
forms  the  focus  of  our  paper.  Such  algorithms  need  to  be 
efficient,  both  in  terms  of  learning  and  channel  access. 

For  any  learning  algorithm,  there  are  two  important  per¬ 
formance  criteria:  convergence  and  regret  bounds  (3.  In  the 
above  context,  we  require  the  estimates  to  converge  to  the 
correct  channel  availability  statistics  as  the  number  of  available 
sensing  decisions  goes  to  infinity.  A  stronger  criterion  is  the 
regret  of  a  learning  algorithm,  which  measures  the  speed 
of  convergence.  In  our  context,  the  regret  is  the  loss  in 
secondary  throughput  due  to  learning  compared  with  knowing 
the  channel  statistics  perfectly.  Hence,  it  is  desirable  for  the 
learning  algorithms  to  have  small  regret.  The  regret  is  a  finer 
measure  of  performance  of  a  learning  algorithm  than  the  time- 
averaged  throughput  since  a  sub-linear  regret  (with  respect  to 
time)  implies  optimal  average  throughput. 

Additionally,  we  consider  a  distributed  framework  where 
there  is  no  information  exchange  or  prior  agreement  among 
the  secondary  users.  This  introduces  additional  challenges: 
it  results  in  loss  of  throughput  due  to  collisions  among  the 
secondary  users,  and  there  is  now  competition  among  the 
secondary  users  since  they  all  tend  to  access  channels  with 
higher  availabilities.  It  is  imperative  for  the  channel  access 
policies  to  overcome  the  above  challenges.  Hence,  a  distributed 
learning  and  access  policy  experiences  regret  both  due  to 
learning  of  the  unknown  channel  availabilities  as  well  as  due 
to  collisions  under  distributed  access. 

A.  Our  Contributions 

The  main  contributions  of  this  paper  are  two  fold.  First, 
we  propose  two  distributed  learning  and  access  policies  for 
multiple  secondary  users  in  a  cognitive  network.  Second,  we 
provide  performance  guarantees  for  these  policies  in  terms  of 
regret.  Overall,  we  prove  that  one  of  our  proposed  algorithms 
achieves  order-optimal  regret  and  the  other  achieves  nearly 
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Fig.  1.  Cognitive  radio  network  with  U  =  4  secondary  users  and  C  —  5 
channels.  A  secondary  user  is  not  allowed  to  transmit  if  the  accessed  channel 
is  occupied  by  a  primary  user.  If  more  than  one  secondary  user  transmits  in 
the  same  free  channel,  then  all  the  transmissions  are  unsuccessful. 


order-optimal  regret,  where  the  order  is  in  terms  of  the  number 
of  transmission  slots. 

The  first  policy  we  propose  assumes  that  the  total  number 
of  secondary  users  in  the  system  is  known  while  our  sec¬ 
ond  policy  relaxes  this  requirement.  Our  second  policy  also 
incorporates  estimation  of  the  number  of  secondary  users,  in 
addition  to  learning  of  the  channel  availabilities  and  designing 
distributed  access  rules.  We  provide  bounds  on  total  regret 
experienced  by  the  secondary  users  under  self  play,  i.e.,  when 
implemented  at  all  the  secondary  users.  For  the  first  policy,  we 
prove  that  the  regret  is  logarithmic,  i.e.,  O(logn)  where  n  in 
the  number  of  transmission  slots.  For  the  second  policy,  the  re¬ 
gret  grows  slightly  faster  than  logarithmic,  i.e.,  0(f(n)  log  n), 
where  we  can  choose  any  function  f(n)  satisfying  f(n)  —>  oo, 
asn->  oo.  Hence,  we  provide  performance  guarantees  for  the 
proposed  distributed  learning  and  access  policies. 

A  lower  bound  on  regret  under  any  uniformly-good  dis¬ 
tributed  learning  policy  has  been  derived  in  |0j,  which  is  also 
logarithmic  in  the  number  of  transmission  slots.  Thus,  our  first 
policy  (which  requires  knowledge  of  the  number  of  secondary 
users)  achieves  order-optimal  regret.  The  effects  of  the  number 
of  secondary  users  and  the  number  of  channels  on  regret  are 
also  explicitly  characterized  and  verified  via  simulations. 

To  the  best  of  our  knowledge,  the  exploration-exploitation 
tradeoff  for  learning,  combined  with  the  cooperation- 
competition  tradeoffs  among  multiple  users  for  distributed 
medium  access  have  not  been  sufficiently  examined  in  the 
literature  before  (see  Section  ITBl  for  a  discussion).  Our 
analysis  in  this  paper  provides  important  engineering  insights 
towards  dealing  with  learning,  competition,  and  cooperation 
in  practical  cognitive  systems. 

Remark:  We  note  some  of  the  shortcomings  of  our  approach. 
The  i.i.d.  modefl  for  primary  transmissions  is  indeed  idealistic 
and  in  practice,  a  Markovian  model  may  be  more  appropriate 
E3,  @.  However,  the  i.i.d.  model  is  a  good  approximation  if 
the  time  slots  for  transmissions  are  long  and/or  the  primary 
traffic  is  highly  bursty.  Moreover,  the  i.i.d.  model  is  not  crucial 
towards  deriving  regret  bounds  for  our  proposed  schemes. 

'By  i.i.d.  primary  transmission  model,  we  do  not  mean  the  presence  of 
a  single  primary  user,  but  rather,  this  model  is  used  to  capture  the  overall 
statistical  behavior  of  all  the  primary  users  in  the  system. 


Extensions  of  the  classical  multi-armed  bandit  problem  to 
a  Markovian  model  are  considered  in  |7j.  In  principle,  our 
results  on  distributed  learning  and  access  can  be  similarly 
extended  to  a  Markovian  channel  model  but  this  entails  more 
complex  estimators  and  rules  for  evaluating  the  exploration- 
exploitation  tradeoffs  of  different  channels  and  is  a  topic  of 
interest  for  future  investigation. 


B.  Related  Work 

Several  results  on  the  multi-armed  bandit  problem  will  be 
used  and  generalized  to  study  our  problem.  Detailed  discussion 
on  multi-armed  bandits  can  be  found  in  l[8l— IfTTII ,  Cognitive 
medium  access  is  a  topic  of  extensive  research;  see  in 
for  an  overview.  The  connection  between  cognitive  medium 
access  and  the  multi-armed  bandit  problem  is  explored  in  lfl3l. 
where  a  restless  bandit  formulation  is  employed.  Under  this 
formulation,  indexability  is  established,  the  Whittle’s  index 
for  channel  selection  is  obtained  in  closed-form,  and  the 
equivalence  between  the  myopic  policy  and  the  Whittle’s 
index  is  established.  However,  this  work  assumes  known 
channel  availability  statistics  and  does  not  consider  competing 
secondary  users.  The  work  in  fT4l  considers  allocation  of  two 
users  to  two  channels  under  Markovian  channel  model  using 
a  partially  observable  Markov  decision  process  (POMDP) 
framework.  The  use  of  collision  feedback  information  for 
learning,  and  spatial  heterogeneity  in  spectrum  opportunities 
were  investigated.  However,  the  difference  from  our  work 
is  that  HI  assumes  that  the  availability  statistics  (transition 
probabilities)  of  the  channels  are  known  to  the  secondary  users 
while  we  consider  learning  of  unknown  channel  statistics. 
The  works  in  HSS,  ED  consider  centralized  access  schemes 
in  contrast  to  distributed  access  here,  H3  considers  access 
through  information  exchange  and  studies  the  optimal  choice 
of  the  amount  of  information  to  be  exchanged  given  the 
cost  of  negotiation.  na  considers  access  under  Q-learning 
for  two  users  and  two  channels  where  users  can  sense  both 
the  channels  simultaneously.  The  work  in  m  discusses  a 
game-theoretic  approach  to  cognitive  medium  access.  In  |[20|, 
learning  in  congestion  games  through  multiplicative  updates 
is  considered  and  convergence  to  weakly-stable  equilibria 
(which  reduces  to  the  pure  Nash  equilibrium  for  almost  all 
games)  is  proven.  However,  the  work  assumes  fixed  costs  (or 
equivalently  rewards)  in  contrast  to  random  rewards  here,  and 
that  the  players  can  fully  observe  the  actions  of  other  players. 

Recently,  the  work  in  ll2ll  considers  combinatorial  bandits, 
where  a  more  general  model  of  different  (unknown)  channel 
availabilities  is  assumed  for  different  secondary  users,  and  a 
matching  algorithm  is  proposed  for  jointly  allocating  users 
to  channels.  The  algorithm  is  guaranteed  to  have  logarithmic 
regret  with  respect  to  number  of  transmission  slots  and  poly¬ 
nomial  storage  requirements.  A  decentralized  implementation 
of  the  proposed  algorithm  is  proposed  but  it  still  requires 
information  exchange  and  coordination  among  the  users.  In 
contrast,  we  propose  algorithms  which  removes  this  require¬ 
ment  albeit  in  a  more  restrictive  setting. 

In  our  recent  work  m,  we  first  formulated  the  problem 
of  decentralized  learning  and  access  for  multiple  secondary 
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users.  We  considered  two  scenarios:  one  where  there  is  initial 
common  information  among  the  secondary  users  in  the  form  of 
pre-allocated  ranks,  and  the  other  where  no  such  information 
is  available.  In  this  paper,  we  analyze  the  distributed  policy  in 
detail  and  prove  that  it  has  logarithmic  regret.  In  addition,  we 
also  consider  the  case  when  the  number  of  secondary  users  is 
unknown,  and  provide  bounds  on  regret  in  this  scenario. 

Recently,  Liu  and  Zhao  0  proposed  a  family  of  distributed 
learning  and  access  policies  known  as  time-division  fair  share 
(TDFS),  and  proved  logarithmic  regret  for  these  policies.  They 
established  a  lower  bound  on  the  growth  rate  of  system  regret 
for  a  general  class  of  uniformly-good  decentralized  polices. 
The  TDFS  policies  in  B]  can  incorporate  any  order-optimal 
single-player  policy  while  our  work  here  is  based  on  the 
single-user  policy  proposed  in  lITTIl.  Another  difference  is  that 
in  Bl,  the  users  orthogonalize  via  settling  at  different  offsets 
in  their  time-sharing  schedule,  while  in  our  work  here,  users 
orthogonalize  into  different  channels.  Moreover,  the  TDFS 
policies  ensure  that  each  player  achieves  the  same  time- 
average  reward  while  our  policies  here  achieve  probabilistic 
fairness,  in  the  sense  that  the  policies  do  not  discriminate  be¬ 
tween  different  users.  In  Il22l.  the  TDFS  policies  are  extended 
to  incorporate  imperfect  sensing. 

Organization  &  Suggested  Reading:  Section  [II]  deals  with 
the  system  model.  Section  [III]  deals  with  the  special  case  of 
single  secondary  user  and  of  multiple  users  with  centralized 
access  which  can  be  directly  solved  using  the  classical  results 
on  multi-armed  bandits.  In  Section HVl  we  propose  distributed 
learning  and  access  policy  with  provably  logarithmic  regret 
when  the  number  of  secondary  users  is  known.  Section  [V| 
considers  the  scenario  when  the  number  of  secondary  users  is 
unknown.  Section  [VI]  provides  a  lower  bound  for  distributed 
learning.  Section  1 VIII  has  simulation  results  for  the  proposed 
schemes  and  Section  IVIIII  concludes  the  paper.  Most  of  the 
proofs  are  found  in  the  Appendix. 

Since  Section  [III]  mostly  deals  with  a  recap  of  the  classical 
results  on  multi-armed  bandits,  we  suggest  that  an  experienced 
reader  directly  jump  to  Section  [IV]  for  the  main  results  of  this 
paper. 

II.  System  Model  &  Formulation 

Notation:  For  any  two  functions  f(n),g(ti),  f(n)  = 
0(g(n))  if  there  exists  a  constant  c  such  that  /(n)  <  cg(n ) 
for  all  n  >  no  for  a  fixed  no  £  N.  Similarly,  f(n)  =  Q(g(n)) 
if  there  exists  a  constant  c!  such  that  f(n)  >  c'g(n)  for  all 
n  >  no  for  a  fixed  no  £  N,  and  f(n )  =  <d(g(n))  if  f(n)  = 
0,(g(n))  and  f(n)  =  0(g(n)).  Also,  f(n)  =  o{g{n))  when 
f{n)/g{n)  ->  0  and  f(n)  =  u(g(n))  when  f{n)/g{n)  ->•  oo 
as  n  oo. 

We  refer  to  the  U  highest  entries  in  a  vector  fi  as  the  {/-best 
channels  and  the  rest  as  the  {/-worst  channels.  Let  cr(T;/r) 
denote  the  index  of  the  Tth  highest  entry  in  /i-  Alternatively, 
we  abbreviate  T*:=a(T ;  fi)  for  ease  of  notation.  With  abuse  of 
notation,  let  D(pi,  H2)'=D(B(hi)',  B(p2))  be  the  Kullback- 
Leibler  distance  between  the  Bernoulli  distributions  B(pt )  and 
B(fi2)  |23]  and  let  A(l,  2):=pi  -  /i2. 


A.  Sensing  &  Channel  Models 

Let  U  >  1  be  the  number  of  secondary  usenfl  and  C  >  U 
be  the  numbei[]  of  orthogonal  channels  available  for  slotted 
transmissions  with  a  fixed  slot  width.  In  each  channel  i  and  slot 
Z,  the  primary  user  transmits  i.i.d.  with  probability  1  /<,  >  0. 

In  other  words,  let  Wi(k)  denote  the  indicator  variable  if  the 
channel  is  free 

0,  channel  i  occupied  in  slot  k 

1,  o.w, 

and  we  assume  that  Wi{k)  l'~"  B(pi)- 

The  mean  availability  vector  //  consists  of  mean  avail¬ 
abilities  ^  of  all  channels,  i.e.,  is  (j,:=[hi,  ■  ■  ■  ,  Me],  where 
all  Hi  £  (0,1)  and  are  distinct,  fi  is  initially  unknown 
to  all  the  secondary  users  and  is  learnt  independently  over 
time  using  the  past  sensing  decisions  without  any  information 
exchange  among  the  users.  We  assume  that  sensing  for  primary 
transmissions  is  perfect  at  all  the  users. 

Let  Tij(k)  denote  the  number  of  slots  where  channel  i  is 
sensed  in  k  slots  by  user  j  (not  necessarily  being  the  sole 
occupant  of  that  channel).  The  sensing  variables  are  obtained 
as  follows:  at  the  beginning  of  each  slot  k,  each  secondary 
user  j  £  U  selects  exactly  one  channel  i  £  C  for  sensing,  and 
hence,  obtains  the  value  of  W)(fc ),  indicating  if  the  channel  is 
free.  User  j  then  records  all  the  sensing  decisions  of  each 
channel  *  in  a  vector  X.^y=[Xij(l), . . . ,  Xiij(Tij(k))]T . 
Hence,  Up  ,  X(:  ■  is  the  collection  of  sensed  decisions  for  user 
j  in  k  slots  for  all  the  C  channels. 

We  assume  the  collision  model  under  which  if  two  or  more 
users  transmit  in  the  same  channel  then  none  of  the  transmis¬ 
sions  go  through.  At  the  end  of  each  slot  k,  each  user  j  receives 
acknowledgement  Zj(k)  on  whether  its  transmission  in  the  kth 
slot  was  received.  Hence,  in  general,  any  policy  employed  by 
user  j  in  the  (fc+l)-th  slot,  given  by  p( •,  Z*:)  is  based 
on  all  the  previous  sensing  and  feedback  results. 

B.  Regret  of  a  Policy 

Under  the  above  model,  we  are  interested  in  designing 
policies  p  which  maximize  the  expected  number  of  successful 
transmissions  of  the  secondary  users  subject  to  the  non¬ 
interference  constraint  for  the  primary  users.  Let  S(n;  /x,  U,  p) 
be  the  expected  total  number  of  successful  transmissions  after 
n  slots  under  U  number  of  secondary  users  and  policy  p. 

In  the  ideal  scenario  where  the  availability  statistics  /x  are 
known  a  priori  and  a  central  agent  orthogonally  allocates  the 
secondary  users  to  the  U- best  channels,  the  expected  number 
of  successful  transmissions  after  n  slots  is  given  by 

u 

S*(n;  Hi  U):=n  H{j*)i  (1) 

j=i 

where  j *  is  the  jth -highest  entry  in  /x. 

2  A  user  refers  to  a  secondary  user  unless  otherwise  mentioned. 

3 When  U  >  C,  learning  availability  statistics  is  less  crucial,  since  all 
channels  need  to  be  accessed  to  avoid  collisions.  In  this  case,  design  of 
medium  access  is  more  crucial. 
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Algorithm  1  Single  User  Policy  p1(g(n))  in  flOl. 

Input:  '■  Sample-mean  availabilities  after  n 

rounds,  g(i;n):  statistic  based  on  Xij(n), 
cr(T;g(n)):  index  of  T"1  highest  entry  in  g(n). 

Init:  Sense  in  each  channel  once,  n  4—  C 
Loop:  n  4—  n  +  1 

Curr_Sel  4—  channel  corresponding  to  highest  entry  in  g(n) 
for  sensing.  If  free,  transmit. 


It  is  clear  that  S*(rv,pi,  U)  >  S{n\  pi,U,  p)  for  any  policy 
p  and  finite  n.  We  are  interested  in  minimizing  the  regret  in 
learning  and  access,  given  by 

R(n ;  pi,  U,  p):=S*{n ;  pi,  U)  —  S(n ;  pi,  U,  p)  >  0.  (2) 

We  are  interested  in  minimizing  regret  under  any  given  pi  £ 
(0,  l)c  with  distinct  elements. 

By  incorporating  the  collision  channel  model  assumption 
with  no  avoidance  mechanism^,  the  expected  throughput 
under  policy  p  is  given  by 

c  u 

S{n-,n,U,p )  =  ^2^2p(i)E[Vij(n)], 

i= 1  3= 1 

where  Vij(n )  is  the  number  of  times  in  n  slots  where  user 
j  is  the  sole  user  to  sense  channel  i.  Hence,  the  regret  in  (0 
simplifies  as 

u  c  u 

R(n;p)  =  ^2,np(k*)  -  (3) 

k= 1  2=1 j= 1 

III.  Special  Cases  From  Known  Results 

We  recap  the  bounds  for  the  regret  under  the  special  cases 
of  a  single  secondary  user  (U  =  1)  and  multiple  users  with 
centralized  learning  and  access  by  appealing  to  the  classical 
results  on  the  multi-armed  bandit  process  lf8l— iTOll , 


A.  Single  Secondary  User  (U  =  1) 

When  there  is  only  one  secondary  user,  the  problem  of 
finding  policies  with  minimum  regret  reduces  to  that  of  a 
multi-armed  bandit  process.  Lai  and  Robbins  |8)  first  analyzed 
schemes  for  multi-armed  bandits  with  asymptotic  logarithmic 
regret  based  on  the  upper  confidence  bounds  on  the  unknown 
channel  availabilities.  Since  then,  simpler  schemes  have  been 
proposed  in  I TOl .  [11]  which  compute  a  statistic  or  an  index  for 
each  arm  (channel),  henceforth  referred  to  as  the  g-statistic, 
based  only  on  its  sample  mean  and  the  number  of  slots  where 
the  particular  arm  is  sensed.  The  arm  with  the  highest  index  is 
selected  in  each  slot  in  these  works.  We  summarize  the  policy 
in  Algorithm  [J  and  denote  it  p1(g(n)),  where  g(n)  is  the 
vector  of  scores  assigned  to  the  channels  after  n  transmission 
slots. 

4The  effect  of  employing  CSMA-CA  is  not  considered  here  although  it 
can  be  shown  that  it  reduces  the  regret  and  hence,  the  bounds  we  derive  are 
applicable. 


The  sample-mean  based  policy  in  [If  Thm.  1]  proposes  an 
index  for  each  channel  i  and  user  j  at  time  n  is  given  by 


g^(r,n):=XiJ(TiJ(n))  + 


I  2  log  n 

TijW 


(4) 


where  Tij(n)  is  the  number  of  slots  where  user  j  selects 
channel  i  for  sensing  and 


Ti.Rn) 

X.J'DJn)):- 


k= 1 


Xij(k) 

Tijin) 


is  the  sample-mean  availability  of  channel  i,  as  sensed  by  user 

j- 

The  statistic  in  ([4])  captures  the  exploration-exploitation 
tradeoff  between  sensing  the  channel  with  the  best  predicted 
availability  to  maximize  immediate  throughput  and  sensing 
different  channels  to  obtain  improved  estimates  of  their  avail¬ 
abilities.  The  sample-mean  term  in  Q  corresponds  to  exploita¬ 
tion  while  the  other  term  involving  Tij(n)  corresponds  to 
exploration  since  it  penalizes  channels  which  are  not  sensed 
often.  The  normalization  of  the  exploration  term  with  log  n  in 
©  implies  that  the  term  is  significant  when  Ti  jin)  is  much 
smaller  than  logn.  On  the  other  hand,  if  all  the  channels 
are  sensed  O(logn)  number  of  times,  the  exploration  terms 
become  unimportant  in  the  ^-statistics  of  the  channels  and  the 
exploitation  term  dominates,  thereby,  favoring  sensing  of  the 
channel  with  the  highest  sample  mean. 

The  regret  based  on  the  above  statistic  in  ©  is  logarithmic 
for  any  finite  number  of  slots  n  but  does  not  have  the  optimal 
scaling  constant.  The  sample-mean  based  statistic  in  cm 
Example  5.7]  leads  to  the  optimal  scaling  constant  for  regret 
and  is  given  by 


5°PT(i;  n):=Xij(Titj(n))  +  min 


(5) 


In  this  paper,  we  design  policies  based  on  the  g™"1  statistic 
since  it  is  simpler  to  analyze  than  the  g'm  statistic. 

We  now  recap  the  results  which  show  logarithmic  regret  in 
learning  the  best  channel.  In  this  context,  we  define  uniformly 
good  policies  p  IS)  as  those  with  regret 


R{n\  pi,U,  p)  =  o{na),  Va  >  0,  pi  G  (0,  l)c .  (6) 


Theorem  1  (Logarithmic  Regret  for  U  =  1  Mil,  HU): 

For  any  uniformly  good  policy  p  satisfying  ©,  the  expected 
time  spent  in  any  suboptimal  channel  i  f  l*  satisfies 


lim  IP 

n—foo 


Ti,i(n)  > 


(1  -  e)  logn 

D(p.i,pi*) 


=  i, 


(7) 


where  1*  is  the  channel  with  the  best  availability.  Hence,  the 
regret  satisfies 


n—foo  log  Tl 


> 


E 

2  El-worst 


(8) 


The  regret  under  the  gOPT  statistic  in  ©  achieves  the  above 
bound. 


lim 

n—f  oo 


R(n;  p,l,  p1(g°PT)) 
log  n 


E 

iEl-worst 


(9) 
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Algorithm  2  Centralized  Learning  Policy  pCENT  in  (91- 

Input:  Xn  :=  U Y—i  U^L1  XW  :  Channel  availability  after  n 
slots,  g (n):  statistic  based  on  Xn, 
cr(T;g(n)):  index  of  Tlh  highest  entry  in  g(n). 

Init:  Sense  in  each  channel  once,  n  •<—  C 
Loop:  n  4—  n  +  1 

Curr_Sel  4—  channels  with  [/-best  entries  in  g(n).  If  free, 
transmit. 


The  regret  under  r/M,  AN  statistic  in  (l34l>  satisfies 

l.p^gT"))  <  X!  A(r’*)  ^  81°Sn 

*#i* 


B.  Centralized  Learning  &  Access  for  Multiple  Users 

We  now  consider  multiple  secondary  users  under  centralized 
access  policies  where  there  is  joint  learning  and  access  by  a 
central  agent  on  behalf  of  all  the  U  users.  Here,  to  minimize 
the  sum  regret,  the  centralized  policy  allocates  the  U  users  to 
orthogonal  channels  to  avoid  collisions.  Let  pCENT(A'fc),  with 
Xk  :=  Uf=1  Ug^  X*  •,  denote  a  centralized  policy  based 
on  the  sensing  variables  of  all  the  users.  The  policy  under 
centralized  learning  is  a  simple  generalization  of  the  single- 
user  policy  and  is  given  in  Algorithm  [2]  We  now  recap  the 
results  of  GO- 

Theorem  2  (Regret  Under  Centralized  Policy  f/"'T 
For  any  uniformly  good  centralized  policy  pCENT  satisfying 
©,  the  expected  times  spent  in  a  U -worst  channel  i  satisfies 


lim  P 

n—¥  oo 


U 


J2Tij(n)  ^ 

3= 1 


(1  —  e)  logn 
D(pi,  pu*)  ^ 


=  1, 


(10) 


where  U*  is  the  channel  with  the  (7,h  best  availability.  Hence, 
the  regret  satisfies 


lim  inf 

n— >■  oo 


i?(n;/r,l,pCEm) 
log  n 


> 


E 

iGU- worst 


D(m,pu*Y 


(ii) 


The  scheme  in  Algorithm  [2]  based  on  achieves  the  above 
bound. 


lim  l,pCENT(gOPT)  =  ^ 


logn 


iGU- worst 


A((7  *,i) 

D{Pi,Pv) 


(12) 


The  scheme  in  Algorithm  [2]  based  on  the  gM,  AN  satisfies  for 
any  n  >  0, 


TDfry TT  -CENT  /  ^MEAN  \  \ 

u  u  \ 

A [m  ,  z) 


^E  E  E 


m— 1  iGU- worst  k 


Proof:  See  Appendix  lAl 


8  log  n  .  7r“ 


^  U  lA(m*,i)2  +1+  3 


(13) 

□ 


IV.  Main  Results 

Armed  with  the  classical  results  on  multi-armed  bandits,  we 
now  design  distributed  learning  and  allocation  policies. 


A.  Preliminaries:  Bounds  on  Regret 

We  first  provide  simple  bounds  on  the  regret  in  ([3])  for  any 
distributed  learning  and  access  policy  p. 

Proposition  1  ( Lower  and  Upper  Bounds  on  Regret):  The 
regret  under  any  distributed  policy  p  satisfies 


R(n-p)>J2  E  A 

j= 1  iGU- worst 
U 

R(rr,p)<p{  1*)  ^  ^Etry(n)]+E[M(n)] 

j= 1  iGU- worst 


(14) 

(15) 


where  Ti  :l  (n)  is  the  number  of  slots  where  user  j  selects 
channel  i  for  sensing,  M(n )  is  the  number  of  collisions  faced 
by  the  users  in  the  U- best  channels  in  n  slots,  A (i,  j )  =  p{i)  — 
p(j)  and  p(l*)  is  the  highest  mean  availability. 

Proof:  See  Appendix  [B]  □ 

In  the  subsequent  sections,  we  propose  distributed  learning 
and  access  policies  and  provide  regret  guarantees  for  the 
policies  using  the  upper  bound  in  (IT5l).  The  lower  bound  in 
(HU)  can  be  used  to  derive  lower  bound  on  regret  for  any 
uniformly-good  policy. 

The  first  term  in  lfl5l>  represents  the  lost  transmission 
opportunities  due  to  selection  of  U- worst  channels  (with 
lower  mean  availabilities),  while  the  second  term  represents 
performance  loss  due  to  collisions  among  the  users  in  the 
[/-best  channels.  The  first  term  in  (fl5l>  decouples  among  the 
different  users  and  can  be  analyzed  solely  through  the  marginal 
distributions  of  the  ^-statistics  at  the  users.  This  in  turn,  can 
be  analyzed  by  manipulating  the  classical  results  on  multi¬ 
armed  bandits  m,  ini  On  the  other  hand,  the  second  term 
in  G3.  involving  collisions  in  the  U- best  channels,  requires 
the  joint  distribution  of  the  (/-statistics  at  different  users  which 
are  correlated  variables.  This  is  intractable  to  analyze  directly 
and  we  develop  techniques  to  bound  this  term. 


B  pRAND  .  Distributed  Learning  and  Access 

We  present  the  p RAND  policy  in  Algorithm  0  Before  de¬ 
scribing  this  policy,  we  make  some  simple  observations.  If 
each  user  implemented  the  single-user  policy  in  Algorithm  Q] 
then  it  would  result  in  collisions,  since  all  the  users  target 
the  best  channel.  When  there  are  multiple  users  and  there 
is  no  direct  communication  among  them,  the  users  need  to 
randomize  channel  access  in  order  to  avoid  collisions.  At 
the  same  time,  accessing  the  (7-worst  channels  needs  to  be 
avoided  since  they  contribute  to  regret.  Hence,  users  can  avoid 
collisions  by  randomizing  access  over  the  (7-best  channels, 
based  on  their  estimates  of  the  channel  ranks.  However,  if  the 
users  randomize  in  every  slot,  there  is  a  finite  probability  of 
collisions  in  every  slot  and  this  results  in  a  linear  growth  of 
regret  with  the  number  of  time  slots.  Hence,  the  users  need 
to  converge  to  a  collision-free  configuration  to  ensure  that  the 
regret  is  logarithmic. 

In  Algorithm  [3]  there  is  adaptive  randomization  based 
on  feedback  regarding  the  previous  transmission.  Each  user 
randomizes  only  if  there  is  a  collision  in  the  previous  slot; 
otherwise,  the  previously  generated  random  rank  for  the  user 
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Algorithm  3  Policy  pRAm(U,C,gj(n))  for  each  user  j  under 
U  users,  C  channels  and  statistic  g j(n). 

Input:  '■  Sample-mean  availabilities  at 

user  j  after  n  rounds,  gj(i\n)\  statistic  based  on  Xij{n), 
a(T ;  gj(n)):  index  of  Tth  highest  entry  in  g j(n). 

(j(i\  n):  indicator  of  collision  at  nlb  slot  at  channel  i 

Init:  Sense  in  each  channel  once,  n  4—  C,  Curr_Rank  4—  1, 

Cjii',  ni)  0 

Loop:  n  4—  n  +  1 

if  (j(Curr_Sel:  n  —  1)  =  1  then 

Draw  a  new  Curr_Rank  ~  Unif([/) 

end  if 

Select  channel  for  sensing.  If  free,  transmit. 

Curr_Sel  4—  cr[Curr_Rank\  gj(n)). 

If  collision  (j  (C urr Sel :  m)  4—  1,  Else  0. 


is  retained.  The  estimation  for  the  channel  ranks  is  through 
the  p-statistic,  on  lines  similar  to  the  single-user  case. 


C.  Regret  Bounds  under  p"*™ 

It  is  easy  to  see  that  the  pRAN,)  policy  ensures  that  the 
users  are  allocated  orthogonally  to  the  [/-best  channels  as 
the  number  of  transmission  slots  goes  to  infinity.  The  regret 
bounds  on  pRAND  are  however  not  immediately  clear  and  we 
provide  guarantees  below. 

We  first  provide  a  logarithmic  upper  bounc0  on  the  number 
of  slots  spent  by  each  user  in  any  U- worst  channel.  Hence,  the 
first  term  in  the  bound  on  regret  in  (IT5l)  is  also  logarithmic. 

Lemma  1  (Time  Spent  in  U-worst  Channels):  Under  the 
pRAND  scheme  in  Algorithm  [3]  the  total  time  spent  by  any  user 
j  =  1, ...,[/,  in  any  i  £  U- worst  channel  is  given  by 


mAn)\  <  E 


r  8  log  n 


7 T 

2  +  1  +  y 


(16) 


Proof:  The  proof  is  on  lines  similar  to  the  proof  for 

Theorem  [21  given  in  Appendix  |A]  □ 

We  now  focus  on  analyzing  the  number  of  collisions  M  ( n ) 
in  the  U- best  channels.  We  first  give  a  result  on  the  expected 
number  of  collisions  in  the  ideal  scenario  where  each  user  has 
perfect  knowledge  of  the  channel  availability  statistics  pi.  In 
this  case,  the  users  attempt  to  reach  an  orthogonal  (collision- 
free)  configuration  by  uniformly  randomizing  over  the  U -best 
channels. 

The  stochastic  process  in  this  case  is  a  finite-state  Markov 
chain.  A  state  in  this  Markov  chain  corresponds  to  a  con¬ 
figuration  of  U  number  of  (identical)  users  in  U  number  of 
channels.  The  number  of  states  in  the  Markov  chain  is  the 
number  of  compositions  of  U,  given  by  (E-1)  l24lThm,5.n, 
The  orthogonal  configuration  corresponds  to  the  absorbing 
state.  For  any  other  state,  consisting  of  more  than  one  user 
or  no  user  in  any  of  the  channels,  the  transition  probability 
to  any  state  of  the  Markov  chain  (including  self  transition 


5Note  that  the  bound  on  E [Tij(n)]  in  i  1 6t  holds  for  user  j  even  if  the 
other  users  are  using  a  policy  other  than  p™.  But  on  the  other  hand,  to 
analyze  the  number  of  collisions  E[M(n)]  in  1191.  we  need  every  user  to 
implement  pRAND. 


probability)  is  uniform.  For  a  state,  where  certain  channels 
have  exactly  one  user,  there  are  only  transitions  to  states  which 
consist  of  at  least  one  user  in  that  channel  and  the  transition 
probabilities  are  uniform.  Let  T ([/,[/)  denote  the  maximum 
time  to  absorption  in  the  above  Markov  chain  starting  from 
any  initial  distribution.  We  have  the  following  result 
Lemma  2  (#  of  Collisions  Under  Perfect  Knowledge): 

The  expected  number  of  collisions  under  p RAND  scheme  in 
Algorithm  [3]  assuming  that  each  user  has  perfect  knowledge 
of  the  mean  channel  availabilities  pi,  is  given  by 


E[M (n);  pRAND(U,  C,  /i)]  <  UE[T(U,U)\ 

'2U-1 


<  U 


u 


(17) 


Proof:  See  Appendix  O  LI 

The  above  result  states  that  there  is  at  most  a  finite  number 
of  expected  collisions,  bounded  by  UK[T(U,  [/)]  under  perfect 
knowledge  of  pi.  In  contrast,  recall  from  the  previous  section, 
that  there  are  no  collisions  under  perfect  knowledge  of  /r, 
in  the  presence  of  pre-allocated  ranks.  Hence,  ?7E[Y ([/,  U)\ 
represents  a  bound  on  the  additional  regret  due  to  the  lack 
of  direct  communication  among  the  users  to  negotiate  their 
ranks. 

We  use  the  result  of  Lemma  [2]  for  analyzing  the  num¬ 
ber  of  collisions  under  distributed  learning  of  the  unknown 
availabilities  fi  as  follows:  if  we  show  that  the  users  are 
able  to  learn  the  correct  order  of  the  different  channels  with 
only  logarithmic  regret  then  only  an  additional  finite  expected 
number  of  collisions  occur  before  reaching  an  orthogonal 
configuration. 

Define  T'(n;  //<AND)  as  the  number  of  slots  where  any  one 
of  the  top-[/  estimated  ranks  of  the  channels  at  some  user  is 
wrong  under  pRAND  policy.  Below  we  prove  that  its  expected 
value  is  logarithmic  in  the  number  of  transmissions. 

Lemma  3  (Wrong  Order  of  g-statistics):  Under  the 
scheme  in  Algorithm  [3 


u  c 

E[f(n;r)]<E  E 

a—  1  b—a-\- 1 

Proof:  See  Appendix  iDl  □ 

We  now  provide  an  upper  bound  on  the  number  of  collisions 
M(n)  in  the  U- best  channels  by  incorporating  the  above  result 
on  E[T'(n)],  the  result  on  the  average  number  of  slots  E[T,;p] 
spent  in  the  U -worst  channels  in  Lemma  Q]  and  the  average 
number  of  collisions  [/E[Y([/,  U)\  under  perfect  knowledge 
of  p  in  Lemma  [3 

Theorem  3  (Logarithmic  Number  of  Collisions  Under  p'“™): 
The  expected  number  of  collisions  in  the  [7-best  channels 
under  pRAND([/,  C,  gMEAN)  scheme  satisfies 


8  log  n  tH 

A(a*,b*)2  3 


■  (18) 


E[M(n)]  <  (7(E[T([7, 17)]  +  1) E [Tj(n)].  (19) 

Hence,  from  (IT6l>.  (ITSl  and  (IT7l).  M(n )  =  O(logn). 

Proof:  See  Appendix  |E]  □ 

Hence,  there  are  only  logarithmic  number  of  expected 
collisions  before  the  users  settle  in  the  orthogonal  channels. 
Combining  this  result  with  Lemma  |T]  that  the  number  of 
slots  spent  in  the  [/-worst  channels  is  also  logarithmic,  we 
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immediately  have  one  of  the  main  results  of  this  paper  that  the 
sum  regret  under  distributed  learning  and  access  is  logarithmic. 

Theorem  4  (Logarithmic  Regret  Under  ffAND):  The  policy 
pRAND(Lr, C, gMEAN)  in  Algorithm [3] has  ©(logn)  regret. 

Proof:  Substituting  (IT9b  and  (1 1 6[i  in  (IT5l).  □ 

Hence,  we  prove  that  distributed  learning  and  channel  access 
among  multiple  secondary  users  is  possible  with  logarithmic 
regret  without  any  explicit  communication  among  the  users. 
This  implies  that  the  number  of  lost  opportunities  for  success¬ 
ful  transmissions  at  all  secondary  users  is  only  logarithmic  in 
the  number  of  transmissions,  which  is  negligible  when  there 
are  large  number  of  transmissions. 

We  have  so  far  focused  on  designing  schemes  that  maximize 
system  or  social  throughput.  We  now  briefly  discuss  the 
fairness  for  an  individual  user  under  pRAND.  Since  pRAND  does 
not  distinguish  any  of  the  users,  in  the  sense  that  each  user 
has  equal  probability  of  “settling"  down  in  one  of  the  17- 
best  channels  while  experiencing  only  logarithmic  regret  in 
doing  so.  Simulations  in  Section  1 VIII  (in  FigQ}  demonstrate 
this  phenomenon. 

V.  Distributed  Learning  and  Access  under 
Unknown  Number  of  Users 

We  have  so  far  assumed  that  the  number  of  secondary 
users  is  known,  and  is  required  for  the  implementation  of 
the  pRAND  policy.  In  practice,  this  entails  initial  announcement 
from  each  of  the  secondary  users  to  indicate  their  presence  in 
the  cognitive  network.  However,  in  a  truly  distributed  setting 
without  any  information  exchange  among  the  users,  such  an 
announcement  may  not  be  possible. 

In  this  section,  we  consider  the  scenario,  where  the  number 
of  users  U  is  unknown  (but  fixed  throughout  the  duration  of 
transmissions  and  U  <  C,  the  number  of  channels).  In  this 
case,  the  policy  needs  to  estimate  the  number  of  secondary 
users  in  the  system,  in  addition  to  learning  the  channel 
availability  statistics  and  designing  channel  access  rules  based 
on  collision  feedback.  Note  that  if  the  policy  assumed  the 
worst-case  scenario  that  U  —  C,  then  the  regret  grows  linearly 
since  U- worst  channels  are  selected  a  large  number  of  times 
for  sensing. 

A.  Description  of  pEST  Policy 

We  now  propose  a  policy  p EST  in  Algorithm  0]  This  policy 
incorporates  two  functions  in  each  transmission  slot,  viz., 
execution  of  the  pRAND  policy  in  Algorithm  [3]  based  on  the 
current  estimate  of  the  number  of  users  U ,  and  updating  of 
the  estimate  U  based  on  the  number  of  collisions  experienced 
by  the  user. 

The  updating  is  based  on  the  idea  that  if  there  is  under¬ 
estimation  of  U  at  all  the  users  (Uj  <  U  at  all  the  users  j), 
collisions  necessarily  build  up  and  the  collision  count  serves 
as  a  criterion  for  incrementing  U.  This  is  because  after  a  long 
learning  period,  the  users  learn  the  true  ranks  of  the  channels, 
and  target  the  same  set  of  channels.  However,  when  there  is 
under-estimation,  the  number  of  users  exceeds  the  number  of 
channels  targeted  by  the  users.  Hence,  collisions  among  the 


Algorithm  4  Policy  p's'  (n.  C,  g j  (rii),  £)  for  each  user  j  under 
n  transmission  slots  (horizon  length),  C  channels,  statistic 
g j(m)  and  threshold  functions  £. 

1)  Input:  {Xj  j(n)}j=i;  x;  :  Sample-mean  availabilities 
at  user  j,  gj(i ;  n):  statistic  based  on  Xij(n), 

cr(T;  gj(n)):  index  of  Tlh  highest  entry  in  g j(n). 

Cj(i]  n ):  indicator  of  collision  at  n,h  slot  at  channel  i 
U:  current  estimate  of  the  number  of  users. 
n:  horizon  (total  number  of  slots  for  transmission) 

2)  Init:  Sense  each  channel  once,  m  «—  C,  Curr_Rank  ■<— 
1,  Up- 1,  (j(i:  m)  0  for  all*  =  1, . . . ,  C 

3)  Loop:  m  P-  m  +  1,  stop  when  m  =  n. 

4)  If  Q(Curr_Sel;  rri  —  1)  =  1  then 

Draw  a  new  Curr_Rank  ~  Unif(6').  end  if 
Select  channel  for  sensing.  If  free,  transmit. 

Curr_Sel  P-  a(Curr_Rank;  gj(m)) 

5)  Q(Curr_Sel',m)  •< —  1  if  collision,  0  o.w. 

6)  If  Er=t  ELi  OW*;;  gjM); a )  >  £(«;  U))  then 

U  p-  U  +  1,  a)  p-  0,  i  =  1, . . .  C,  a  =  1, . . .  ,m. 

end  if 


users  accumulate,  and  can  be  used  as  a  test  for  incrementing 

U. 

Denote  the  collision  count  used  by  pEST  policy  as 

m  k 

:=  ^^0(<7(&;gj(m));a).  (2°) 

a=  1  6=1 

which  is  the  total  number  of  collisions  experienced  by  user  j 
so  far  (till  the  mlh  transmission  slot)  in  the  top-U,  channels, 
where  the  ranks  of  the  channels  are  estimated  using  the  g- 
statistics.  The  collision  count  is  tested  against  a  threshold 
£(n-,Uj),  which  is  a  function  of  the  horizon  length  and 
current  estimate  Uj.  When  the  threshold  is  exceeded,  Uj  is 
incremented,  and  the  collision  samples  collected  so  far  are 
discarded  (by  setting  them  to  zero)  (line  [6]  in  Algorithm  0}. 

B.  Regret  Bounds  under  pEST 

We  analyze  regret  bounds  under  the  pEST  policy,  where  the 
regret  is  defined  in  (0).  Let  the  maximum  threshold  function 
for  the  number  of  consecutive  collisions  under  p's'  policy  be 
denoted  by 

£f(n\U):=  max  !;(n;k).  (21) 

Jfe=l, ...,£/ 

We  prove  that  the  pEST  policy  has  0(£*(rr,U))  regret  when 
£*(«;  U)  =  w(logn),  and  where  n  is  the  number  of  transmis¬ 
sion  slots. 

The  proof  for  the  regret  bound  under  pI  ST  policy  consists 
of  two  main  parts:  we  prove  bounds  on  regret  conditioned  on 
the  event  that  none  of  the  users  over-estimate  U.  Second,  we 
show  that  the  probability  of  over-estimation  at  any  of  the  users 

'’In  this  section,  we  assume  that  the  users  are  aware  of  the  horizon  length 
n  for  transmission.  Note  that  this  is  not  a  limitation  and  can  be  extended 
to  case  of  unknown  horizon  length  as  follows:  implement  the  algorithm  by 
fixing  horizon  lengths  to  no ,  2no ,  4no  . . .  for  a  fixed  no  G  N  and  discarding 
estimates  from  previous  stages. 
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goes  to  zero  asymptotically.  Combined  together,  we  obtain  the 
regret  bound  for  p'sl  policy. 

Note  that  in  order  to  have  small  regret,  it  is  crucial  that 
none  of  the  users  over-estimate  U .  This  is  because  when  there 
is  over-estimation,  there  is  a  finite  probability  of  selecting 
the  17-worst  channels  even  upon  learning  the  true  ranks  of 
the  channels.  Note  that  regret  is  incurred  whenever  a  U-worst 
channel  is  selected  since  under  perfect  knowledge  this  channel 
would  not  be  selected.  Hence,  under  over-estimation,  the  regret 
grows  linearly  in  the  number  of  transmissions. 

In  a  nutshell,  under  the  pEST  policy,  the  decision  to  increment 
the  estimate  U  reduces  to  a  hypothesis-testing  problem  with 
hypotheses  Ho-  number  of  users  is  less  than  or  equal  to  the 
current  estimate  and  H\.  number  of  users  is  greater  than 
the  current  estimate.  In  order  to  have  a  sub-linear  regret, 
the  false-alarm  probability  (deciding  Hi  under  Ho)  needs  to 
decay  asymptotically.  This  is  ensured  by  selecting  appropriate 
thresholds  £(n)  to  test  against  the  collision  counts  obtained 
through  feedback. 

Conditional  Regret:  We  now  give  the  result  for  the  first 
part.  Define  the  “good  event”  C(n;  U )  that  none  of  the  users 
over-estimates  U  under  p'sr  as 


u 

C(n-,U):={f)U™(n)<U}.  (22) 

3= 1 

The  regret  conditioned  on  C(n\U),  denoted  by 
R(n ;  /x,  U ,  pEST)|C(n;  U),  is  given  by 

u  c  u 

k—1  i=  1  j= 1 

where  V)j(n)  is  the  number  of  times  that  user  j  is  the  sole 
user  of  channel  i.  Similarly,  we  have  conditional  expectations 
of  E[Titj(n)\C(rr,  U)]  and  of  the  number  of  collisions  in  17- 
best  channels,  given  by  E[M  (n)\C(n;  U)].  We  now  show  that 
the  regret  conditioned  on  C(n\  U)  is  0(max(£*(n;  17),  logn)). 

Lemma  4:  (Conditional  Regret):  When  all  the  U  secondary 
users  implement  p'sr  policy,  we  have  for  all  i  £  17-worst 
channel  and  each  user  j  =  1, . . . ,  U, 


u 

E[Ti,,(n)|C(n)]  <  ]T 
k= 1 


8  log  n 
A  (i,k*)2 


(23) 


The  conditional  expectation  on  number  of  collisions  M (n)  in 
the  17-best  channel  satisfies 


u 

E[M(n)\C(n-U)\  <  <  U2C(n;U).  (24) 

fc=  l 

From  (fl5l).  we  have  R(n)\C(rv,  U )  is  0(max(£*(n;  17),  logn)) 
for  any  n  €  N. 

Proof:  See  Appendix  [F]  □ 

Probability  of  Over-estimation:  We  now  prove  that  none 
of  the  users  over-estimate^  U  under  p'K'  policy,  i.e.,  the 
probability  of  the  event  C(n\U)  in  (l22t  approaches  one  as 


7Note  that  pEST  policy  automatically  ensures  that  all  the  users  do  not 
under-estimate  U,  since  it  increments  U  based  on  collision  estimate.  This 
implies  that  the  probability  of  the  event  that  all  the  users  under-estimate  U 
goes  to  zero  asymptotically. 


n  — >•  oo,  when  the  thresholds  £(n;  U)  for  testing  against 
the  collision  count  are  chosen  appropriately  (see  line  [6]  in 
Algorithm  |U).  Trivially,  we  can  set  £(n;  1)  =  1  since  a  single 
collision  is  enough  to  indicate  that  there  is  more  than  one  user. 
For  any  other  k  >  1,  we  choose  functions  £  satisfying 

£(n;  k)  =  w(logn),  Vfc  >  1.  (25) 

We  prove  that  the  above  condition  ensures  that  over-estimation 
does  not  occur. 

Recall  that  T'(n;  pEST)  is  the  number  of  slots  where  any  one 
of  the  top-(7  estimated  ranks  of  the  channels  at  some  user  is 
wrong  under  pEST  policy.  We  show  that  E[T'(n)]  is  O(logn). 

Lemma  5  (Time  spent  with  wrong  estimates ):  The 
expected  number  of  slots  where  any  of  the  top -(7  estimated 
ranks  of  the  channels  at  any  user  is  wrong  under  p™  policy 
satisfies 

ep>)]  <  uYl  E 

a= 1  b=a+ 1 

Proof:  The  proof  is  on  the  lines  of  Lemma  [3]  □ 

Recall  the  definition  of  T(U,U)  in  the  previous  section, 
as  the  maximum  time  to  absorption  starting  from  any  initial 
distribution  of  the  finite-state  Markov  chain,  where  the  states 
correspond  to  different  user  configurations  and  the  absorbing 
state  corresponds  to  the  collision-free  configuration.  We  now 
generalize  the  definition  to  T(U,  k),  as  the  time  to  absorption 
in  a  new  Markov  chain,  where  the  state  space  is  the  set  of 
configurations  of  U  users  in  k  channels,  and  the  transition 
probabilities  are  defined  on  similar  lines.  Note  that  T(U,k) 
is  almost-surely  finite  when  k  >  U  and  oo  otherwise  (since 
there  is  no  absorbing  state  in  the  latter  case). 

We  now  bound  the  maximum  value  of  the  collision  count 
under  pEST  policy  in  ([27)1  using  T'(m),  the  total  time 
spent  with  wrong  channel  estimates,  and  T(Z7,  k),  the  time  to 

s t 

absorption  in  the  Markov  chain.  Let  <  denote  the  stochastic 
order  for  two  random  variables  |[25l. 

Proposition  2:  The  maximum  collision  count  in  (l20t  over 
all  users  under  the  pESI  policy  satisfies 

max  <  (T'(m)  +  1)T(C7,  k),  Mm  £  N.  (27) 

Proof:  The  proof  is  on  the  lines  of  Theorem  [2  See  Ap¬ 
pendix  0  □ 

We  now  prove  that  the  probability  of  over-estimation  goes 
to  zero  asymptotically. 

Lemma  6  (No  Over-estimation  Under  pEsr):  For  threshold 

functions  satisfying  d25l,  the  event  C(n;U)  in  (l22l  satisfies 

lim  P[C(n;  U)\  =  1,  (28) 

n—f  oo 

and  hence,  none  of  the  users  over-estimates  U  under  p™ 
policy. 

Proof:  See  Appendix  IH1  □ 

We  now  give  the  main  result  of  this  section  that  p™  has 
slightly  more  than  logarithmic  regret  asymptotically  and  this 
depends  on  the  threshold  function  £*(n;  U )  in  (OH. 


8  logn  j  tH 
A(a*,b*)2  3  . 


(26) 
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Theorem  5  (Asymptotic  Regret  Under  pEST):  With  threshold 
functions  £  satisfying  conditions  in  (l25l>.  the  policy 
pEST(n,C,gj(m),l;)  in  Algorithm  0]  satisfies 


lim  sup 

n— KX5 


i?(n;/x,£/,pEST) 

£*(n;  u) 


<  oo. 


(29) 


Proof:  From  Lemma  0]  and  Lemma  [6]  □ 

Hence,  the  regret  under  the  proposed  p's'  policy  is 
()(f*  (n:  [/))  under  fully  decentralized  setting  without  the 
knowledge  of  number  of  users  when  £*(n;[/)  =  cu(logn). 
Hence,  0(/(n)  logn)  regret  is  achievable  for  all  functions 
f(n)  — >•  oo  as  n  — >  oo.  The  question  of  whether  logarithmic 
regret  is  possible  under  unknown  number  of  users  is  of  interest. 

Note  the  difference  between  pEST  policy  in  Algorithm  0] 
under  unknown  number  of  users  with  pRAND  policy  with  known 
number  of  users  in  Algorithm  [2  The  regret  under  p's[  is 
0(/(n)logn)  for  any  function  f(n)  =  w(l),  while  it  is 
O(logn)  under  pRAND  policy.  Hence,  we  are  able  to  quantify 
the  degradation  of  performance  when  the  number  of  users  is 
unknown. 


VI.  Lower  Bound  &  Effect  of  Number  of  Users 


A.  Lower  Bound  For  Distributed  Learning  &  access 

We  have  so  far  designed  distributed  learning  and  access 
policies  with  provable  bounds  on  regret.  We  now  discuss  the 
relative  performance  of  these  policies,  compared  to  the  optimal 
learning  and  access  policies.  This  is  accomplished  by  noting 
a  lower  bound  on  regret  for  any  uniformly-good  policy,  first 
derived  in  S)  for  a  general  class  of  uniformly-good  time- 
division  policies.  We  restate  the  result  below. 

Theorem  6  ( Lower  Bound  For  any  uniformly  good 
distributed  learning  and  access  policy  p,  the  sum  regret  in 
©  satisfies 


n—foo  log  Tl 


>  A([/*,  i ) 

~  ijft, 


(30) 


The  lower  bound  derived  in  (9]  for  centralized  learning  and 
access  holds  for  distributed  learning  and  access  considered 
here.  But  a  better  lower  bound  is  obtained  above  by  consid¬ 
ering  the  distributed  nature  of  learning.  The  lower  bound  for 
distributed  policies  is  worse  than  the  bound  for  the  centralized 
policies  in  (ITTb.  This  is  because  each  user  independently  learns 
the  channel  availabilities  //  in  a  distributed  policy,  whereas 
sensing  decisions  from  all  the  users  are  used  for  learning  in  a 
centralized  policy. 

Our  distributed  learning  and  access  policy  pRAND  matches 
the  lower  bound  on  regret  in  051)  in  the  order  (log  n)  but  the 
scaling  factors  are  different.  It  is  not  clear  if  the  regret  lower 
bound  in  (l30l)  can  be  achieved  by  any  policy  under  no  explicit 
information  exchange  and  is  a  topic  for  future  investigation. 


B.  Behavior  with  Number  of  Users 

We  have  so  far  analyzed  the  sum  regret  under  our  policies 
under  a  fixed  number  of  users  U.  We  now  analyze  the  behavior 
of  regret  growth  as  U  increases  while  keeping  the  number  of 
channels  C  >  U  fixed. 


Theorem  7  (Varying  Number  of  Users):  When  the  number 
of  channels  C  is  fixed  and  the  number  of  users  U  <  C  is 
varied,  the  sum  regret  under  centralized  learning  and  access 
pc ent  in  (fl2T>  decreases  as  U  increases  while  the  upper  bounds 
on  the  sum  regret  under  pRAND  in  051)  monotonic  ally  increases 
with  U. 

Proof:  The  proof  involves  analysis  of  02l)  and  05li.  To 

prove  that  the  sum  regret  under  centralized  learning  and  access 
in  03  decreases  with  the  number  of  users  U ,  it  suffices  to 
show  that  for  i  £  [/-worst  channel, 

A([/y) 

D{pi,pu *) 

decreases  as  U  increases.  Note  that  p{U*)  and  D(pi,pu*) 
decrease  as  U  increases.  Hence,  it  suffices  to  show  that 

P{U*) 

D(pi,pu*) 

decreases  with  U.  This  is  true  since  its  derivative  with  respect 
to  U  is  negative. 

For  the  upper  bound  on  regret  under  pRAND  in  05l>.  when  U 
is  increased,  the  number  of  U- worst  channels  decreases  and 
hence,  the  first  term  in  (01  decreases.  However,  the  second 
term  consisting  of  collisions  M(n)  increases  to  a  far  greater 
extent.  □ 

Note  that  the  above  results  is  for  the  upper  bound  on  regret 
under  the  pRAND  policy  and  not  the  regret  itself.  Simulations  in 
Section  [Vll]  reveal  that  the  actual  regret  also  increases  with  U. 
Under  the  centralized  scheme  pCENT,  as  U  increases,  the  number 
of  U- worst  channels  decreases.  Hence,  the  regret  decreases, 
since  there  are  less  number  of  possibilities  of  making  bad 
decisions.  However,  for  distributed  schemes  although  this 
effect  exists,  it  is  far  outweighed  by  the  increase  in  regret 
due  to  the  increase  in  collisions  among  the  U  users. 

In  contrast,  the  distributed  lower  bound  in  (f30b  displays 
anomalous  behavior  with  U  since  it  fails  to  account  for 
collisions  among  the  users.  Here,  as  U  increases  there  are 
two  competing  effects:  a  decrease  in  regret  due  to  decrease 
in  the  number  of  [/-worst  channels  and  an  increase  in  regret 
due  to  increase  in  the  number  of  users  visiting  these  [/-worst 
channels. 


VII.  Numerical  Results 

We  present  simulations  that  vary  the  schemes  and  the 
number  of  users  and  channels  to  verify  the  performance  of 
the  algorithms  detailed  earlier.  We  consider  C=  9  channels 
(or  a  subset  of  them  when  the  number  of  channels  is  varying) 
with  probabilities  of  availability  characterized  by  Bernoulli 
distributions  with  evenly  spaced  parameters  ranging  from  0.1 
to  0.9. 

Comparison  of  Different  Schemes:  Figj2a|  compares  the 
regret  under  the  centralized  and  random  allocation  schemes  in 
a  scenario  with  U  =  4  cognitive  users  vying  for  access  to  the 
C  =  9  channels.  The  theoretical  lower  bound  for  the  regret 
in  the  centralized  case  from  Theorem  0]  and  the  distributed 
case  from  Theorem  [6]  are  also  plotted.  The  upper  bounds  on 
the  random  allocation  scheme  from  Theorem  0]  is  not  plotted 
here,  since  the  bounds  are  loose  especially  as  the  number  of 


10 


(a)  Normalized  regret  [^’T[  vs.  n  slots. 
U  =  4  users,  C  =  9  channels. 


(b)  Normalized  regret  [^’r[  vs.  n  slots. 
U  =  4  users,  C  =  9  channels. 


Fig.  2.  Simulation  Results.  Probability  of  Availability  fi  =  [0.1,  0.2, . . . ,  0.9]. 


(c)  No.  of  collisions  M(n)  vs.  n  slots. 

U  =  4  users,  C  =  9  channels,  pRAND  policy. 


(a)  Normalized  regret  vs.  17  users. 

C  =  9  channels,  n  =  2500  slots. 


(b)  Normalized  regret  vs.  C  channels. 

U  =  2  users,  n  =  2500  slots. 


Fig.  3.  Simulation  Results.  Probability  of  Availability  fj,  =  [0.1, 0.2, . .  .  ,0.9]. 


(c)  Normalized  regret  vs.  U  users. 

User-channel  ratio  ^  =  0.5,  n  =  2500  slots. 


users  U  increases.  Finding  tight  upper  bounds  is  a  subject  of 
future  study. 

As  expected,  centralized  allocation  has  the  least  regret. 
Another  important  observation  is  the  gap  between  the  lower 
bounds  on  the  regret  and  the  actual  regret  in  both  the  dis¬ 
tributed  and  the  centralized  cases.  In  the  centralized  scenario, 
this  is  simply  due  to  using  the  <7MEAN  statistic  in  (17 41  instead 
of  the  optimal  gOVI  statistic  in  (0.  However,  in  the  distributed 
case,  there  is  an  additional  gap  since  we  do  not  account  for 
collisions  among  the  users.  Hence,  the  schemes  under  con¬ 
sideration  are  0(log  n)  and  achieve  order  optimality  although 
they  are  not  optimal  in  the  scaling  constant. 

Performance  with  Varying  U  and  C:  Fig|3a]  explores  the 
impact  of  increasing  the  number  of  secondary  users  U  on  the 
regret  experienced  by  the  different  policies  while  fixing  the 
number  of  channels  C.  With  increasing  [/,  the  regret  decreases 
for  the  centralized  schemes  and  increases  for  the  distributed 
schemes,  as  predicted  in  Theorem  [7]  The  monotonic  increase 
of  regret  under  random  allocation  pRAND  is  a  result  of  the 
increase  in  the  collisions  as  U  increases.  While  the  monotonic 
decreasing  behavior  in  the  centralized  case  is  because  as  the 
number  of  users  increases,  the  number  of  [/-worst  channels 
decreases  resulting  in  lower  regret.  Also,  the  lower  bound 
for  the  distributed  case  in  (l30l>  initially  increases  and  then 
decreases  with  U  This  is  because  as  U  increases  there  are 
two  competing  effects:  decrease  in  regret  due  to  decrease  in 
number  of  U- worst  channels  and  increase  in  regret  due  to 
increase  in  number  of  users  visiting  these  U- worst  channels. 


Figj3b]evaluates  the  performance  of  the  different  algorithms 
as  the  number  of  channels  C  is  varied  while  fixing  the  number 
of  users  U.  The  probability  of  availability  of  each  additional 
channel  is  set  higher  than  those  already  present.  Here,  the 
regret  monotonically  increases  with  C  in  all  cases.  When  the 
number  of  channels  increases  along  with  the  quality  of  the 
channels,  the  regret  increases  as  a  result  of  an  increase  in  the 
number  of  U -worst  channels  as  well  as  the  increasing  gap  in 
quality  between  the  U- best  and  U -worst  channels. 

Also,  the  situation  where  the  ratio  ^  is  fixed  to  be  0.5 
and  both  the  number  of  users  and  channels  along  with  their 
quality  increase  is  considered  in  Figj3c]  As  the  number  of 
users  increases  the  regret  increases  as  the  number  of  channels 
C  and  their  quality  are  both  increasing.  Once  again,  this  is 
in  agreement  with  theory  as  the  number  of  U- worst  channels 
increases  as  U  and  C  increase  while  keeping  ^  fixed. 

Collisions  and  Learning:  Fig|23  verifies  the  logarithmic 
nature  of  the  number  collisions  under  the  random  allocation 
scheme  f)RAS':.  Additionally,  we  also  plot  the  number  of  col¬ 
lisions  under  pRAND  in  the  ideal  scenario  when  the  channel 
availability  statistics  //,  are  known  to  see  the  effect  of  learning 
on  the  number  of  collisions.  The  low  value  of  the  number 
of  collisions  obtained  under  known  channel  parameters  in 
the  simulations  is  in  agreement  with  theoretical  predictions, 
analyzed  as  [/E[Y([/,  U)]  in  Lemma  [2]  As  the  number  of 
slots  n  increases,  the  gap  between  the  number  of  collisions 
under  the  known  and  unknown  parameters  increases  since  the 
former  converges  to  a  finite  constant  while  the  latter  grows  as 
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Fig.  4.  Simulation  Results.  Probability  of  Availability  /x  = 
[0.1,  0.2, . . . ,  0.9].  No.  of  slots  where  user  has  best  channel  vs.  user.  U  =  4, 
C  =  9,  n  =  2500  slots,  1000  runs,  p^0. 


Of  log  n).  The  logarithmic  behavior  of  the  cumulative  number 
of  collisions  can  be  inferred  from  Fig|2a]  However,  the  curve 
in  Figj2c]for  the  unknown  parameter  case  appears  linear  in  n 
due  to  the  small  value  of  n. 

Difference  between  gOPT  and  gMEAN :  Since  the  statistic  p™AN 
used  in  the  schemes  in  this  paper  differs  from  the  optimal 
statistic  gOPT  in  ©,  a  simulation  is  done  to  compare  the  perfor¬ 
mance  of  the  schemes  under  both  the  statistics.  As  expected,  in 
Lig|2bl  the  optimal  scheme  has  better  performance.  However, 
the  use  of  g™"1  enables  us  to  provide  finite-time  bounds,  as 
described  earlier. 

Fairness:  One  of  the  important  features  of  pRAND  is  that 
it  does  not  favor  any  one  user  over  another.  Each  user  has 
an  equal  chance  of  settling  down  in  any  one  of  the  {7-best 
channels.  Figj4]  evaluates  the  fairness  characteristics  of  pRANIJ. 
The  simulation  assumes  {7  =  4  cognitive  users  vying  for 
access  to  C  =  9  channels.  The  graph  depicts  which  user 
asymptotically  gets  the  best  channel  over  1000  runs  of  the 
random  allocation  scheme.  As  can  be  seen,  each  user  has 
approximately  the  same  frequency  of  being  allotted  the  best 
channel  indicating  that  the  random  allocation  scheme  is  indeed 
fair. 


VIII.  Conclusion 

In  this  paper,  we  proposed  novel  policies  for  distributed 
learning  of  channel  availability  statistics  and  channel  access 
of  multiple  secondary  users  in  a  cognitive  network.  The  first 
policy  assumed  that  the  number  of  secondary  users  in  the 
network  is  known,  while  the  second  policy  removed  this 
requirement.  We  provide  provable  guarantees  for  our  policies 
in  terms  of  sum  regret.  Combined  with  the  lower  bound  on 
regret  for  any  uniformly-good  learning  and  access  policy,  our 
first  policy  achieves  order-optimal  regret  while  our  second 
policy  is  also  nearly  order  optimal.  Our  analysis  in  this  paper 
provides  insights  on  incorporating  learning  and  distributed 
medium  access  control  in  a  practical  cognitive  network. 

The  results  of  this  paper  open  up  an  interesting  array  of 
problems  for  future  investigation.  Our  assumptions  of  an  i.i.d. 
model  for  primary  user  transmissions  and  perfect  sensing  at 
the  secondary  users  need  to  be  relaxed.  Our  policy  allows  for 
an  unknown  but  fixed  number  of  secondary  users,  and  it  is  of 
interest  to  incorporate  users  dynamically  entering  and  leaving 


the  system.  Moreover,  our  model  ignores  dynamic  traffic  at 
the  secondary  nodes  and  extension  to  a  queueing-theoretic 
formulation  is  desirable. 
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Appendix 


A.  Proof  of  Theorem  |2] 

The  result  in  (IT3l>  involves  extending  the  results  of  m 
Thm.  1],  Define  X)(n):=  Y^j= i  7,j(n)  as  the  number  of  times 
a  channel  i  is  sensed  in  n  rounds  for  all  users.  We  will  show 
that 


E[r,(«)]  <  E 

k£U- best 


81ogn 
A  (k*,i)2 


We  have 


V*  €  {7-worst. 

(31) 


P[Tx.  in  i  in  n'h  slot]  =  P[p((7*;n)  <  g(i;n)\, 
=F[A(i\n)  n  ( g{U*-n )  <  g(i\n ))] 

+  P[Ac(i]  n)  n  {g{U*;  n)  <  g{i\  n))], 

where 

A(i-,n):=  [J  (g(k;n)  <  g(i;n)) 

k£U- best 

is  the  event  that  at  least  one  of  the  {7-best  channels  has  g- 
statistic  less  than  i.  Hence,  from  union  bound  we  have 

P[A(i;n)]<  E  %(fc;  n)  <  g(r,  rij). 

/cGC/-best 

We  have  for  C  >  U, 


P[Ac{i-,n)  n  ( g(U*;n )  <  g(f,ri))\  =  0, 


Hence, 


P[Tx.  in  i  in  nth  round]  <  E]  P\g(k;n)  <  g(i;  n)]. 


fcE-best 


On  the  lines  of  EH  Thm.  1],  we  have  Vfc,  i 
k  is  {7-best,  i  is  {7-worst 

81ogn  _  7 r2 


E^[s(M)  <<?(*;()]  <  A(fcv)2 


+  i  + 


Hence,  we  have  (HQ.  Lor  the  bound  on  regret,  we  can  break 
R  in  ©  into  two  terms 


u 


R(n-,»,U,pc™)=  E  [^EA(r’*)]E[T‘(n)] 

CZ-worst  /— 1 


E  [^EA(r’*)JE[W]. 

i(zU- best  1=1 


U 
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For  the  second  term,  we  have 
i  y 


E  bEA(r’*)]E^w] 


lu 

■iGtZ-best  l—l 


<  E[T*(n)] 

i£U- best  1=1 


U 


=  o, 


where  T*(n):=  max  TAn).  Hence,  we  have  the  bound.  □ 

i£U-  best 


B.  Proof  of  Proposition  |T| 

For  convenience,  let  Ti(n )  :=  Y^=i  Ti,j(n),  Vi(n)  := 

_ ^  jj  _ ^ 

"Yhj-x  Vi, jin)-  Note  that  =  since  each  user 

selects  one  channel  for  sensing  in  each  slot  and  there  are  U 
users.  From  ([3}, 

u  c 

R(n)  =n^2  p(i*)  -  ^/i(*)E[Hi(n)], 

i=  1  i=l 

<  E  P(i)(n-^[Vi{n)}) 

i(zU- best 

<p(l*)(nU-  J2  E^WD  02) 

2  £  [/-best 

=p(r)(E[M(n)}+  Y,  ElUn)}),  (33) 

i(zU- worst 

where  Ean.(l32l)  uses  the  fact  that  V,{n)  <  n  since  total  number 
of  sole  occupancies  in  n  slots  of  channel  i  is  at  most  n,  and 
Eqn.(|33]»  uses  the  fact  that  M(n)  =  J2ieu- best(r*(n)  “  Vi(n)). 

For  the  lower  bound,  since  each  user  selects  one  channel 
for  sensing  in  each  slot,  E!=iEj=i^>j(Tl)  =  nU-  Now 
Ti,j(n )  >  Vijin). 


R(n;p,U,p)  >  — 


u  u  c 


EEE^^-^^wi 

k= 1  j= 1  i—l 


>E  E  MU*,i)mAn)}. 

j=l  i  Gt/- worst 


□ 


where  p  is  the  probability  of  having  an  orthogonal  configura¬ 
tion  in  a  slot.  This  is  in  fact  the  reciprocal  of  the  number  of 
compositions  of  U  124,  Thm.  5.1],  given  by 


P  = 


2(7-1 
U 


-l 


(35) 


The  above  expression  is  nothing  but  the  reciprocal  of  number 
of  ways  U  identical  balls  (users)  can  be  placed  in  U  different 
bins  (channels):  there  are  2U  —  1  possible  positions  to  form 
U  partitions  of  the  balls. 

Now  for  the  random  allocation  scheme  without  the  genie, 
any  user  not  experiencing  collision  does  not  draw  a  new 
variable  from  Unif((7).  Hence,  the  number  of  possible  config¬ 
urations  in  any  slot  is  lower  than  under  genie-aided  scheme. 
Since  there  is  only  one  configuration  satisfying  orthogonal it\Q. 
the  probability  of  orthogonality  increases  in  the  absence  of  the 
genie  and  is  at  least  (1351).  Hence,  the  number  of  slots  to  reach 
orthogonality  without  the  genie  is  at  most  (l34l).  Since  in  any 
slot,  at  most  U  collisions  occur,  G3  holds.  □ 


D.  Proof  of  Lemma  [77] 

Ip.  „  — ,  /  2  los  n 

uei  c„,m— y  m  . 

Case  1:  Consider  U  =  C  =  2  first.  Let 

1)  <  1  -  1)  >  l}. 

On  lines  of  fTTI  Thm.  1], 

n 

T\n)<l  +  22WM 

t= 2 

oo  t 

^+E  E  I  (Xi *j{h)  +  ct,h  <  x2 +  Ct,m )  ■ 

t=l  m-\-h=l 

The  above  event  is  implied  by 

Xi *,j{h)  +  Ct,h  S  X2 *,j(h)  +  Ct,h+m 

since  Ct,rn  7  Ct,h+rri' 

The  above  event  implies  at  least  one  of  the  following  events 
and  hence,  we  can  use  the  union  bound. 


C.  Proof  of  Lemma  \2\ 

Although,  we  could  directly  compute  the  time  to  absorption 
of  the  Markov  chain,  we  give  a  simple  bound  E[Y(f7,  (7)]  by 
considering  an  i.i.d  process  over  the  same  state  space.  We  term 
this  process  as  a  genie-aided  modification  of  random  allocation 
scheme,  since  this  can  be  realized  as  follows:  in  each  slot,  a 
genie  checks  if  any  collision  occurred,  in  which  case,  a  new 
random  variable  is  drawn  from  Unif(77)  by  all  users.  This  is 
in  contrast  to  the  original  random  allocation  scheme  where  a 
new  random  variable  is  drawn  only  when  the  particular  user 
experiences  a  collision.  Note  that  for  U  =  2  users,  the  two 
scenarios  coincide. 

For  the  genie-aided  scheme,  the  expected  number  of  slots  to 
hit  orthogonality  is  just  the  mean  of  the  geometric  distribution 


E**1  -p)kp 

k=i 


1  -P 


<  oo, 


Xl*  ^  /i  1* 

X-2*  tj  (7?7.)  ^  fJy 2*  Ct,h-\-rm 
fll*  <C  2*  +  ‘ICt^h+rri' 


From  the  Chernoff-Hoeffding  bound, 

P[A'i*  ,j(t)  <  Mi*  -  ct,h\  <  t~A, 

P[X2*,j  7  M2*  +  Ct,h+m\  <  t  4, 

and  the  event  that  mi*  <  M2*  +  2ct,h+m  implies  that 


Since 


h  +  m  < 


8  logf 

^l*,2* 


OO  t  t 

EEE2*-4 


t=l  m=  1  h=l 


7 T 


2 


3  ’ 


p 


(34) 


since  all  users  are  identical  for  this  analysis. 
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E  [T'(n-U 


C  =  2)]  < 


81ogn 


A2 


1  + 


3  ' 


G.  Proof  of  Proposition  \2} 

Define  the  good  event  as  all  users  having  correct  top -U 
order,  given  by 


Case  2:  For  min  (17,  C)  >  2,  we  have 

u  C  n 

r(n)<uj2  e 

a=  1  b=a-\- 1  tyi—  1 

where  a*  and  b*  represent  channels  with  alh  and  l/h  highest 
availabilities.  On  lines  of  the  result  for  U  =  C  =  2,  we  can 
show  that 

n  o  ]  2 

E  ™)]  <  -c^  + 1  +  E 

m—l  ^a‘,b*  6 

Hence,  (fl8l)  holds.  □ 


u 

S  (n):=  (Top-17  entries  of  gj(n)  are  same  as  in  //,}. 

.7=1 

The  number  of  slots  under  the  bad  event  is 

n 

E  J[scM]  =  r (")» 

m=l 

by  definition  of  T'(n).  In  each  slot,  either  a  good  or  a  bad 
event  occurs.  Let  7  be  the  total  number  of  collisions  in  fc-best 
channels  between  two  bad  events,  i.e.,  under  a  run  of  good 
events.  In  this  case,  all  the  users  have  the  correct  top-17  ranks 
of  channels  and  hence, 

7|S(n)  <  UT(U,k), 


E.  Proof  of  Theorem  Q] 

Define  the  good  event  as  all  users  having  correct  top -U 
order  of  the  //-statistics,  given  by 

u 

S  (n):=  f"'|  (Top-17  entries  of  gj(n)  are  same  as  in  p, } . 

.7=1 

The  number  of  slots  under  the  bad  event  is 

n 

E  /[ Sc(m)]  =  T’(n), 

m—l 

by  definition  of  T'(n).  In  each  slot,  either  a  good  or  a  bad 
event  occurs.  Let  7  be  the  total  number  of  collisions  in  17- 
best  channels  between  two  bad  events,  i.e.,  under  a  run  of 
good  events.  In  this  case,  all  the  users  have  the  correct  top -U 
ranks  of  channels  and  hence, 

E[7|S(n)]  <  m[T(U,U)]  <  00, 

where  E[Y(f7,  U)}  is  given  by  (1771).  Hence,  each  transition 
from  the  bad  to  the  good  state  results  in  at  most  U"K[T(U,  17)] 
expected  number  of  collisions  in  the  U- best  channels.  The 
expected  number  of  collisions  under  the  bad  event  is  at  most 
UE[T'(n)\.  Hence.  (fl9l>  holds.  □ 


F.  Proof  of  Lemma  0 

Under  C(n;U),  a  U- worst  channel  is  sensed  only  if  it  is 
mistaken  to  be  a  U-best  channel.  Hence,  on  lines  of  LemmaH] 


The  number  of  collisions  under  the  bad  event  is  at  most  T'[ri). 
Hence,  (l27l)  holds.  □ 


H.  Proof  of  Lemma  \6\ 
We  are  interested  in 


P[Cc(n);  U]  =  P[uy=1  f/fT(n)  >  U], 

n  U 

=  P[U  U >  an;U)}}, 

m—l  j= 1 

=  P[  max  <&u ;j(n)  >  ^(n;  U)], 

where  <L>  is  given  by  (f20b.  For  U  =  1,  we  have  P[Cc(n);  U\  =  0 
since  no  collisions  occur. 

Using  (l27l>  in  Proposition  [2] 


P[max$*  j(n)  >  £(n;fc)] 

3= 1 

<  P[fcT(U,  k)(T'(n)  +  1)  >  C(n;  k)] 

<  P [k{T\n)  +  1)  >  EEil]  +  p[T(J7,  k)  >  an 

kan(E[T'(n)\  +  1)  w  , 

£  — + p|T(t/>  k)  >  “J’ 


(36) 


using  Markov  inequality.  By  choosing  an  =  w(l),  the  second 
term  in  (l36l>.  viz.,  P[T (U,k)  >  an]  — >•  0  as  n  — >  00,  for 
k  >U.  For  the  first  term,  from  (l26l>  in  Lemma  [5)  E[T'(n)]  = 
O(logn).  Hence,  by  choosing  an  =  o(^*(n;  fc)/ log ?r),  the 
first  term  decays  to  zero.  Since  £*(n;  U)  =  cu(logn),  we  can 
choose  an  satisfying  both  the  conditions.  By  letting  k  =  1 7  in 
(l36ll.  we  have  P[Cc(n);  U]  — >  0  as  n  — >  00,  and  (l28ll  holds.  □ 


E[Tij(n)\C(n-,U)\  =  O(logn),  Vi  G  17-worst,j  =  1 

For  the  number  of  collisions  M(n)  in  the  17-best  channels, 
there  can  be  at  most  U  1  £(n>  &)  collisions  in  the  17-best 
channels  where  a  :=  iriaxj_i  ...  j/  Uj  is  the  maximum  estimate 
of  number  of  users.  Conditioned  on  C(n;U,),  a  <  U,  and 
hence,  we  have  (l24l>.  □ 
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